Skip to content

Conversation

@melkap01-Arm
Copy link
Contributor

Key changes

This PR makes changes to improve the performance on Dynamic Qgemms by implementing tiling and threading across operations.

The changes introduce thread local buffers for reusing memory during inference. And utilizes those in Dynamic Quantised Matmul operations using Kleidiai kernels.

And updating KleidiAI version to 1.15.0

Example performance

single thread :
ort_ops_compare_encoder_1_2025-10-02_17-21-32_vs_encoder_1_2025-10-02_16-54-55

2 threads :
ort_ops_compare_encoder_2_2025-10-02_17-21-47_vs_encoder_2_2025-10-02_16-55-13

@melkap01-Arm
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Arm"

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@patryk-kaiser-ARM
Copy link
Contributor

Can we get workflows ran please

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ?

@hariharans29
Copy link
Member

Will trigger CI once you push commits addressing the PR feedback (right now I only see a rebase). Thanks.

@melkap01-Arm
Copy link
Contributor Author

General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ?

We checked the existing tests for qgemm. In current implementation tests are supported for thread pool = null. We created a follow up ticket for test coverage.

@hariharans29
Copy link
Member

General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ?

We checked the existing tests for qgemm. In current implementation tests are supported for thread pool = null. We created a follow up ticket for test coverage.

If all the tests are with ThreadPool == null, does that mean the new threadpool based parallel code path(s) are not exercised ?

@melkap01-Arm
Copy link
Contributor Author

General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ?

We checked the existing tests for qgemm. In current implementation tests are supported for thread pool = null. We created a follow up ticket for test coverage.

If all the tests are with ThreadPool == null, does that mean the new threadpool based parallel code path(s) are not exercised ?

It means it was not exercised on the onnxruntime_mlas_test run, but it is on the onnxruntime_perf_test. However, unit tests for the multithreaded code added now, in the latest commit. Both cases can use multiple threads in the latest situation.

unused variable removed,
unnecessary temp_tile use and copy removed,
K==0 case checked

Signed-off-by: melkap01 <[email protected]>
@JonathanC-ARM JonathanC-ARM force-pushed the melkap01_implement_mt_qgemm branch from 64c59e5 to 3fcba09 Compare January 15, 2026 12:03
@JonathanC-ARM JonathanC-ARM force-pushed the melkap01_implement_mt_qgemm branch from 27d570c to 0c60f4e Compare January 15, 2026 12:18
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements multithreading and tiling for Dynamic Quantized GEMM operations using KleidiAI kernels to improve performance on ARM64 SME/SME2 architectures. The changes introduce thread-local buffers for memory reuse during inference and update KleidiAI to version 1.15.0.

Changes:

  • Refactored dynamic quantization matrix multiplication to use thread-local buffers and parallel tiling across batch, M, and N dimensions
  • Moved KleidiAI packing logic from operator-specific code to a reusable base class
  • Extended test coverage to include single-threaded and multi-threaded test suites with edge cases

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
onnxruntime/test/mlas/unittest/test_dynamic_qgemm.cpp Splits tests into single-thread and thread-pool variants, adds proper quantization simulation and edge case handling
onnxruntime/test/contrib_ops/dynamic_quantize_matmul_test.cc Adds KleidiAI-specific tests for bias handling, zero-point validation, and fallback scenarios
onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h Extracts KleidiAI prepacking logic into reusable helper methods in the base class
onnxruntime/core/mlas/lib/qgemm.cpp Updates availability check to include both SME and SME2
onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp Implements multi-threaded tiling with thread-local buffers and adds input validation
onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Adds UseSME flag alongside existing UseSME2
onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc Simplifies by delegating prepacking to base class and removes duplicate code

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hariharans29
Copy link
Member

Please rebase with main and the CUDA / TensorRT issues should go away

@hariharans29
Copy link
Member

May have some conflicts with #26849

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29 hariharans29 enabled auto-merge (squash) January 16, 2026 17:09
@hariharans29 hariharans29 merged commit 4013dc1 into microsoft:main Jan 16, 2026
88 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants